Dominic Bordelon, Research Data Librarian
University of Pittsburgh Library System
November 8, 2022
Binning (age 54 \(\rightarrow\) 50–60), rounding
Publication of summary statistics and contingency tables
What are some shortcomings of these techniques?
Source: Sweeney (2015)
Source: Sweeney (2015)
Source: Sweeney (2015)
Besides being based in relatively new technologies, genomics poses new privacy challenges…
A typical datum in this area is a single-nucleotide polymorphism (SNP), which is a difference in a single nucleotide of genetic code.
SNPs may or may not result in phenotypic differences—and many of them are in non-coding regions of DNA, anyway—but taken altogether, they form a unique “fingerprint.”
SNPs are the raw data; attacks often combine these data and accompanying metadata (e.g., patient’s demographic info).
An overview of privacy intrusions and safeguards in genomic data flows in Wan et al. (2022)
“Perhaps the most surprising property of differential privacy is that, despite its protective strength, it is compatible with meaningful data analysis.”
this is because
“data snoopers are not interested in the population, rather, they are interested in this specific realization of data from the population, namely the database itself, \(D\). . . . In a sense, \(D\) is a population, and a data snooper is trying to learn about its fixed unknown parameters”
We can think about the topic of differential privacy in some different ways:
We can think about the topic of differential privacy in some different ways:
We can think about the topic of differential privacy in some different ways:
Sequential composition: “the \(\epsilon\)s add up”
\(\epsilon =\) our privacy budget, determined by the curator (the symbol is a lowercase epsilon)
Sequential composition “bounds the total privacy cost of releasing multiple results of differentially private mechanisms on the same input data.” (Near and Abuah 2021)
More queries on a dataset \(\rightarrow\) less privacy, as represented by the budget adding up
Given two mechanisms with budgets of 1 and 2 respectively, applying them sequentially results in a dataset with a total budget of 3: \(\epsilon\)-1 + \(\epsilon\)-2 = \(\epsilon\)-3
Each mechanism enforces its \(\epsilon\) mathematically
Compare with \(k\)-anonymity, which changes multiplicatively—differential privacy is a stronger guarantee
Sequential composition: “the \(\epsilon\)s add up”
Parallel composition: chunking the dataset, each with its own DP mechanism
Post-processing: if a dataset is differentially private, the mechanism cannot be reversed through any kind of post-processing.
\[ \Pr[\mathcal{A}(D_1) \in S] \leq \exp(\epsilon) \cdot \Pr[\mathcal{A}(D_2) \in S] \]
A nice figure: https://www.nature.com/articles/s41576-022-00455-y/figures/2
What is sensitive or vulnerable is not always obvious.
Who is the adversary?
When working with data, we should cultivate and maintain a critical awareness of the potential power embedded in our sociotechnical systems—we owe it to our fellow members of society.
Algorithmic Privacy